Why public datasets?

Working with large, open-access datasets can serve many purposes. It can be an excellent way to explore new ideas, before investing in field-work or experiments. It can be a great way to take local or experimental results and expand them to different ecosystems, places, or landscapes. Or it can be an excellent way to build, validate, and test ecological models on regional or national scales.

So why doesn’t everyone use public data? Well, it’s often collected by a variety of organizations, with different methods, units, and incosistent metadata. Together these issues with large public datasets, make them “messy.” Messy data can be messy in many different ways, but at the basic level it means that data is hard to analyze, not because the data itself is bad, but because the way it is organized is unclear or inconsistent.

In this lab, we will learn some tricks to “tidying” data, making it analysis-ready. We will depend heavily on the tidyverse, an excellent series of packages that make data manipulation beautiful and easy. We will also be working with water quality portal data so we will also use the excellent dataRetrieval package for downloading data from the Water Quality Portal and the USGS.

Loading key packages

This lab is meant to introduce the incredible variety of tools that one can use to clean data, many of these tools are captured by the tidyverse meta-package, a package of packages, but there are some additional ones that will help us locate our various water quality sites.

library(tidyverse) # Package with dplyr, tibble, readr, and others to help clean coding
library(dataRetrieval) # Package to download data. 
library(sf) #Geospatial package to plot and explore data
library(mapview) #Simple interface to leaflet interactive maps
library(broom) #Simplifies model outputs
library(knitr) #Makes nice tables
library(kableExtra) #Makes even nicer tables
library(lubridate) #Makes working with dates easier
library(ggthemes) #Makes plots prettier
library(tidyr) #Makes multiple simultaneous models easier

Downloading data.

For this lab, we’ll explore water quality data in the Colorado River basin as it moves from Colorado to Arizona. All data will be generated through the code you see below, with the only external information coming from knowing the SiteID’s for the monitoring locations along the Colorado River and the water quality characteristic names.

The water quality portal can be accessed with the command readWQPdata, which takes a variety of parameters (like startdate, enddate, constituents, etc…). We’ll generate these rules for downloading the data here.

Download prep

#First we'll make a tibble (a tidyverse table) with Site IDs. Generally these are increasingly downstream of the CO headwaters near Grand Lake. 
colorado <- tibble(sites=c('USGS-09034500','USGS-09069000','USGS-09071100',
                           'USGS-09085000','USGS-09095500','USGS-09152500',
                           'USGS-09180000','USGS-09180500','USGS-09380000'),
                   basin=c('colorado1','eagle','colorado2',
                           'roaring','colorado3','gunnison',
                           'dolores','colorado4','colorado5'))

#Now we need to setup a series of rules for downloading data from the Water Quality Portal. 
#We'll focus on cation and anion data from 1950-present. Each cation has a name that we might typically use like calcium or sulfate, but the name may be different in the water quality portal, so we have to check this website https://www.waterqualitydata.us/Codes/Characteristicname?mimeType=xml to get our names correct. 

paramater.names <- c('ca','mg','na','k','so4','cl','hco3')

ca <- c('Calcium')
mg <- c('Magnesium')
na <- 'Sodium'
k <- 'Potassium'
so4 <- c('Sulfate','Sulfate as SO4','Sulfur Sulfate','Total Sulfate')
cl <- 'Chloride'
hco3 <- c('Alkalinity, bicarbonate','Bicarbonate')

#Compile all these names into a single list
parameters <- list(ca,mg,na,k,so4,cl,hco3)
#Name each cation or anion in the list
names(parameters) <- paramater.names
#Notice that we aren't downloading any nutrients (P or N) because they are much messier (100s of different ways to measure and report concentration data) than other cation anion data. 

#Start dates
start <- '1980-10-01'
end <- '2024-03-20'

#Sample media (no sediment samples)
sampleMedia = 'Water'

#Comple all this information into a list with arguments
site.args <- list(siteid=colorado$sites,
                  sampleMedia=sampleMedia,
                  startDateLo=start,
                  startDateHi=end,
                  characteristicName=NA) #We'll fill this in later in a loop

Concentration data download

Now that we have generated the commands to download the data, the code to download the data is here, but it is not run on purpose because it takes 15 minutes or so to run everytime. You can always run it yourself by setting eval=T.

conc.list <- list() #Empty list to hold each data download


#We'll loop over each anion or cation and download all data at our sites for that constituent
for(i in 1:length(parameters)){
  #We need to rename the characteristicName (constituent) each time we go through the loop
  site.args$characteristicName<-parameters[[i]]
  
  #readWQPdata takes in our site.args list and downloads the data according to those rules 
  # time, constituent, site, etc...
  
  # Don't forget about pipes "%>%"! Pipes pass forward the results of a previous command, so that 
  #You don't have to constantly rename variables. I love them. 
  
  conc.list[[i]] <- readWQPdata(site.args) %>%
    mutate(parameter=names(parameters)[i]) #Mutate just adds a new column to the data frame
  
  #Pipes make the above command simple and succinct versus something more complicated like:
  
  ## conc.list[[i]] <- readWQPdata(site.args) 
  ## conc.list[[i]]$parameter <- names(parameters)[i]
}

#bind all this data together into a single data frame
conc.long <- map_dfr(conc.list,rbind)

Site info download

We also need to download some site info so we can know where these sites are.

#In addition to concentration informatino, we probably want to know some things about the sites
#dplyr::select can help us only keep site information that is useful. 

site.info <- whatWQPsites(siteid=colorado$sites) %>%
  dplyr::select(SiteID=MonitoringLocationIdentifier,
                  name=MonitoringLocationName,
                  area=DrainageAreaMeasure.MeasureValue,
                  area.units=DrainageAreaMeasure.MeasureUnitCode,
                  lat=LatitudeMeasure,
                  long=LongitudeMeasure) %>%
  distinct() #Distinct just keeps the first of any duplicates. 

Data tidying

Now that we have downloaded the data we need to tidy it up. The water quality portal data comes with an incredible amount of metadata in the form of extra columns. But we don’t need all this extra data. ## Look at the data you downloaded.

There are two data types we downloaded. First site info which has things like lat and long, and second concentration data. We already slightly tidied the site info data so that it has sensible column names

head(site.info)
## # A tibble: 6 × 6
##   SiteID        name                                 area area.units lat   long 
##   <chr>         <chr>                               <dbl> <chr>      <chr> <chr>
## 1 USGS-09380000 COLORADO RIVER AT LEES FERRY, AZ   111800 sq mi      36.8… -111…
## 2 USGS-09034500 COLORADO RIVER AT HOT SULPHUR SPR…    825 sq mi      40.0… -106…
## 3 USGS-09069000 EAGLE RIVER AT GYPSUM, CO             842 sq mi      39.6… -106…
## 4 USGS-09071100 COLORADO RIVER NEAR GLENWOOD SPRI…     NA <NA>       39.5… -107…
## 5 USGS-09085000 ROARING FORK RIVER AT GLENWOOD SP…   1453 sq mi      39.5… -107…
## 6 USGS-09095500 COLORADO RIVER NEAR CAMEO, CO.       7986 sq mi      39.2… -108…

This dataset looks nice because it has all the information we need and nothing extra. Now let’s look at the concentration data.

head(conc.long) %>%
  kable(.,'html') %>%
  kable_styling() %>%
  scroll_box(width='800px',height='300px')
OrganizationIdentifier OrganizationFormalName ActivityIdentifier ActivityTypeCode ActivityMediaName ActivityMediaSubdivisionName ActivityStartDate ActivityStartTime.Time ActivityStartTime.TimeZoneCode ActivityEndDate ActivityEndTime.Time ActivityEndTime.TimeZoneCode ActivityDepthHeightMeasure.MeasureValue ActivityDepthHeightMeasure.MeasureUnitCode ActivityDepthAltitudeReferencePointText ActivityTopDepthHeightMeasure.MeasureValue ActivityTopDepthHeightMeasure.MeasureUnitCode ActivityBottomDepthHeightMeasure.MeasureValue ActivityBottomDepthHeightMeasure.MeasureUnitCode ProjectIdentifier ActivityConductingOrganizationText MonitoringLocationIdentifier ActivityCommentText SampleAquifer HydrologicCondition HydrologicEvent SampleCollectionMethod.MethodIdentifier SampleCollectionMethod.MethodIdentifierContext SampleCollectionMethod.MethodName SampleCollectionEquipmentName ResultDetectionConditionText CharacteristicName ResultSampleFractionText ResultMeasureValue ResultMeasure.MeasureUnitCode MeasureQualifierCode ResultStatusIdentifier StatisticalBaseCode ResultValueTypeName ResultWeightBasisText ResultTimeBasisText ResultTemperatureBasisText ResultParticleSizeBasisText PrecisionValue ResultCommentText USGSPCode ResultDepthHeightMeasure.MeasureValue ResultDepthHeightMeasure.MeasureUnitCode ResultDepthAltitudeReferencePointText SubjectTaxonomicName SampleTissueAnatomyName ResultAnalyticalMethod.MethodIdentifier ResultAnalyticalMethod.MethodIdentifierContext ResultAnalyticalMethod.MethodName MethodDescriptionText LaboratoryName AnalysisStartDate ResultLaboratoryCommentText DetectionQuantitationLimitTypeName DetectionQuantitationLimitMeasure.MeasureValue DetectionQuantitationLimitMeasure.MeasureUnitCode PreparationStartDate ProviderName timeZoneStart timeZoneEnd ActivityStartDateTime ActivityEndDateTime parameter
USGS-UT USGS Utah Water Science Center nwisut.01.00901063 Sample-Routine Water Surface Water 2009-07-07 15:30:00 MDT 2009-07-07 16:00:59 MDT NA NA NA NA NA NA NA NA U.S. Geological Survey USGS-09180500 NA NA Stable, normal stage Routine sample 10 USGS parameter code 82398 Equal width increment (ewi) US D-95 Teflon bottle NA Calcium Dissolved 49.7 mg/l NA Accepted NA Actual NA NA NA NA NA NA 00915 NA NA NA NA NA PLA11 USGS Metals, wf, ICP-AES (NWQL) USGS OF 93-125, p 101 USGS-National Water Quality Lab, Denver, CO 2009-07-17 NA Laboratory Reporting Level 0.02 mg/l NA NWIS 6 6 2009-07-07 21:30:00 2009-07-07 22:00:59 ca
USGS-UT USGS Utah Water Science Center nwisut.01.00900557 Sample-Routine Water Surface Water 2009-03-11 14:30:00 MDT 2009-03-11 14:31:59 MDT NA NA NA NA NA NA NA NA U.S. Geological Survey-Water Resources Discipline USGS-09180500 NA NA Stable, normal stage Spring breakup USGS USGS USGS Unknown NA Calcium Dissolved 82.9 mg/l NA Accepted NA Actual NA NA NA NA NA NA 00915 NA NA NA NA NA PLA11 USGS Metals, wf, ICP-AES (NWQL) USGS OF 93-125, p 101 USGS-National Water Quality Lab, Denver, CO 2009-03-21 NA Laboratory Reporting Level 0.02 mg/l NA NWIS 6 6 2009-03-11 20:30:00 2009-03-11 20:31:59 ca
USGS-UT USGS Utah Water Science Center nwisut.01.00901434 Sample-Routine Water Surface Water 2009-08-20 13:00:00 MDT 2009-08-20 13:30:59 MDT NA NA NA NA NA NA NA NA U.S. Geological Survey USGS-09180500 NA NA Stable, normal stage Routine sample USGS USGS USGS Unknown NA Calcium Dissolved 98.9 mg/l NA Accepted NA Actual NA NA NA NA NA NA 00915 NA NA NA NA NA PLA11 USGS Metals, wf, ICP-AES (NWQL) USGS OF 93-125, p 101 USGS-National Water Quality Lab, Denver, CO 2009-09-30 NA Laboratory Reporting Level 0.02 mg/l NA NWIS 6 6 2009-08-20 19:00:00 2009-08-20 19:30:59 ca
USGS-UT USGS Utah Water Science Center nwisut.01.00900766 Sample-Routine Water Surface Water 2009-04-21 12:00:00 MDT 2009-04-21 12:01:59 MDT NA NA NA NA NA NA NA NA U.S. Geological Survey-Water Resources Discipline USGS-09180500 NA NA Stable, normal stage Routine sample USGS USGS USGS Unknown NA Calcium Dissolved 58.5 mg/l NA Accepted NA Actual NA NA NA NA NA NA 00915 NA NA NA NA NA PLA11 USGS Metals, wf, ICP-AES (NWQL) USGS OF 93-125, p 101 USGS-National Water Quality Lab, Denver, CO 2009-05-11 NA Laboratory Reporting Level 0.02 mg/l NA NWIS 6 6 2009-04-21 18:00:00 2009-04-21 18:01:59 ca
USGS-UT USGS Utah Water Science Center nwisut.01.00900979 Sample-Routine Water Surface Water 2009-06-10 12:00:00 MDT 2009-06-10 12:30:59 MDT NA NA NA NA NA NA NA NA U.S. Geological Survey-Water Resources Discipline USGS-09180500 NA NA Stable, normal stage Routine sample USGS USGS USGS Unknown NA Calcium Dissolved 46.5 mg/l NA Accepted NA Actual NA NA NA NA NA NA 00915 NA NA NA NA NA PLA11 USGS Metals, wf, ICP-AES (NWQL) USGS OF 93-125, p 101 USGS-National Water Quality Lab, Denver, CO 2009-06-30 NA Laboratory Reporting Level 0.02 mg/l NA NWIS 6 6 2009-06-10 18:00:00 2009-06-10 18:30:59 ca
USGS-UT USGS Utah Water Science Center nwisut.01.00901102 Sample-Routine Water Surface Water 2009-07-13 12:00:00 MDT 2009-07-13 12:30:59 MDT NA NA NA NA NA NA NA NA U.S. Geological Survey USGS-09180000 NA NA Stable, normal stage Routine sample USGS USGS USGS Unknown NA Calcium Dissolved 72.8 mg/l NA Accepted NA Actual NA NA NA NA NA NA 00915 NA NA NA NA NA PLA11 USGS Metals, wf, ICP-AES (NWQL) USGS OF 93-125, p 101 USGS-National Water Quality Lab, Denver, CO 2009-08-11 NA Laboratory Reporting Level 0.02 mg/l NA NWIS 6 6 2009-07-13 18:00:00 2009-07-13 18:30:59 ca

Initial cleaning up

Wow that looks messy! Lots of extraneous columns, lots of NAs, so much information we can hardly parse it. Let’s pair it down to the essentials.

#This code mostly just grabs and renames the most important data columns
conc.clean <-  conc.long %>%
                  dplyr::select(date=ActivityStartDate,
                         parameter=CharacteristicName,
                         units=ResultMeasure.MeasureUnitCode,
                         SiteID=MonitoringLocationIdentifier,
                         org=OrganizationFormalName,
                         org_id=OrganizationIdentifier,
                         time=ActivityStartTime.Time,
                         value=ResultMeasureValue,
                         sample_method=SampleCollectionMethod.MethodName,
                         analytical_method=ResultAnalyticalMethod.MethodName,
                         particle_size=ResultParticleSizeBasisText,
                         date_time=ActivityStartDateTime,
                         media=ActivityMediaName,
                         sample_depth=ActivityDepthHeightMeasure.MeasureValue,
                         sample_depth_unit=ActivityDepthHeightMeasure.MeasureUnitCode,
                         fraction=ResultSampleFractionText,
                         status=ResultStatusIdentifier) %>%
  #Remove trailing white space in labels
  mutate(units = trimws(units)) %>%
  #Keep only samples that are water samples
  filter(media=='Water') #Some of these snuck through!

Now let’s look at the tidier version

head(conc.long) %>%
  kable(.,'html') %>%
  kable_styling() %>%
  scroll_box(width='800px',height='300px')
OrganizationIdentifier OrganizationFormalName ActivityIdentifier ActivityTypeCode ActivityMediaName ActivityMediaSubdivisionName ActivityStartDate ActivityStartTime.Time ActivityStartTime.TimeZoneCode ActivityEndDate ActivityEndTime.Time ActivityEndTime.TimeZoneCode ActivityDepthHeightMeasure.MeasureValue ActivityDepthHeightMeasure.MeasureUnitCode ActivityDepthAltitudeReferencePointText ActivityTopDepthHeightMeasure.MeasureValue ActivityTopDepthHeightMeasure.MeasureUnitCode ActivityBottomDepthHeightMeasure.MeasureValue ActivityBottomDepthHeightMeasure.MeasureUnitCode ProjectIdentifier ActivityConductingOrganizationText MonitoringLocationIdentifier ActivityCommentText SampleAquifer HydrologicCondition HydrologicEvent SampleCollectionMethod.MethodIdentifier SampleCollectionMethod.MethodIdentifierContext SampleCollectionMethod.MethodName SampleCollectionEquipmentName ResultDetectionConditionText CharacteristicName ResultSampleFractionText ResultMeasureValue ResultMeasure.MeasureUnitCode MeasureQualifierCode ResultStatusIdentifier StatisticalBaseCode ResultValueTypeName ResultWeightBasisText ResultTimeBasisText ResultTemperatureBasisText ResultParticleSizeBasisText PrecisionValue ResultCommentText USGSPCode ResultDepthHeightMeasure.MeasureValue ResultDepthHeightMeasure.MeasureUnitCode ResultDepthAltitudeReferencePointText SubjectTaxonomicName SampleTissueAnatomyName ResultAnalyticalMethod.MethodIdentifier ResultAnalyticalMethod.MethodIdentifierContext ResultAnalyticalMethod.MethodName MethodDescriptionText LaboratoryName AnalysisStartDate ResultLaboratoryCommentText DetectionQuantitationLimitTypeName DetectionQuantitationLimitMeasure.MeasureValue DetectionQuantitationLimitMeasure.MeasureUnitCode PreparationStartDate ProviderName timeZoneStart timeZoneEnd ActivityStartDateTime ActivityEndDateTime parameter
USGS-UT USGS Utah Water Science Center nwisut.01.00901063 Sample-Routine Water Surface Water 2009-07-07 15:30:00 MDT 2009-07-07 16:00:59 MDT NA NA NA NA NA NA NA NA U.S. Geological Survey USGS-09180500 NA NA Stable, normal stage Routine sample 10 USGS parameter code 82398 Equal width increment (ewi) US D-95 Teflon bottle NA Calcium Dissolved 49.7 mg/l NA Accepted NA Actual NA NA NA NA NA NA 00915 NA NA NA NA NA PLA11 USGS Metals, wf, ICP-AES (NWQL) USGS OF 93-125, p 101 USGS-National Water Quality Lab, Denver, CO 2009-07-17 NA Laboratory Reporting Level 0.02 mg/l NA NWIS 6 6 2009-07-07 21:30:00 2009-07-07 22:00:59 ca
USGS-UT USGS Utah Water Science Center nwisut.01.00900557 Sample-Routine Water Surface Water 2009-03-11 14:30:00 MDT 2009-03-11 14:31:59 MDT NA NA NA NA NA NA NA NA U.S. Geological Survey-Water Resources Discipline USGS-09180500 NA NA Stable, normal stage Spring breakup USGS USGS USGS Unknown NA Calcium Dissolved 82.9 mg/l NA Accepted NA Actual NA NA NA NA NA NA 00915 NA NA NA NA NA PLA11 USGS Metals, wf, ICP-AES (NWQL) USGS OF 93-125, p 101 USGS-National Water Quality Lab, Denver, CO 2009-03-21 NA Laboratory Reporting Level 0.02 mg/l NA NWIS 6 6 2009-03-11 20:30:00 2009-03-11 20:31:59 ca
USGS-UT USGS Utah Water Science Center nwisut.01.00901434 Sample-Routine Water Surface Water 2009-08-20 13:00:00 MDT 2009-08-20 13:30:59 MDT NA NA NA NA NA NA NA NA U.S. Geological Survey USGS-09180500 NA NA Stable, normal stage Routine sample USGS USGS USGS Unknown NA Calcium Dissolved 98.9 mg/l NA Accepted NA Actual NA NA NA NA NA NA 00915 NA NA NA NA NA PLA11 USGS Metals, wf, ICP-AES (NWQL) USGS OF 93-125, p 101 USGS-National Water Quality Lab, Denver, CO 2009-09-30 NA Laboratory Reporting Level 0.02 mg/l NA NWIS 6 6 2009-08-20 19:00:00 2009-08-20 19:30:59 ca
USGS-UT USGS Utah Water Science Center nwisut.01.00900766 Sample-Routine Water Surface Water 2009-04-21 12:00:00 MDT 2009-04-21 12:01:59 MDT NA NA NA NA NA NA NA NA U.S. Geological Survey-Water Resources Discipline USGS-09180500 NA NA Stable, normal stage Routine sample USGS USGS USGS Unknown NA Calcium Dissolved 58.5 mg/l NA Accepted NA Actual NA NA NA NA NA NA 00915 NA NA NA NA NA PLA11 USGS Metals, wf, ICP-AES (NWQL) USGS OF 93-125, p 101 USGS-National Water Quality Lab, Denver, CO 2009-05-11 NA Laboratory Reporting Level 0.02 mg/l NA NWIS 6 6 2009-04-21 18:00:00 2009-04-21 18:01:59 ca
USGS-UT USGS Utah Water Science Center nwisut.01.00900979 Sample-Routine Water Surface Water 2009-06-10 12:00:00 MDT 2009-06-10 12:30:59 MDT NA NA NA NA NA NA NA NA U.S. Geological Survey-Water Resources Discipline USGS-09180500 NA NA Stable, normal stage Routine sample USGS USGS USGS Unknown NA Calcium Dissolved 46.5 mg/l NA Accepted NA Actual NA NA NA NA NA NA 00915 NA NA NA NA NA PLA11 USGS Metals, wf, ICP-AES (NWQL) USGS OF 93-125, p 101 USGS-National Water Quality Lab, Denver, CO 2009-06-30 NA Laboratory Reporting Level 0.02 mg/l NA NWIS 6 6 2009-06-10 18:00:00 2009-06-10 18:30:59 ca
USGS-UT USGS Utah Water Science Center nwisut.01.00901102 Sample-Routine Water Surface Water 2009-07-13 12:00:00 MDT 2009-07-13 12:30:59 MDT NA NA NA NA NA NA NA NA U.S. Geological Survey USGS-09180000 NA NA Stable, normal stage Routine sample USGS USGS USGS Unknown NA Calcium Dissolved 72.8 mg/l NA Accepted NA Actual NA NA NA NA NA NA 00915 NA NA NA NA NA PLA11 USGS Metals, wf, ICP-AES (NWQL) USGS OF 93-125, p 101 USGS-National Water Quality Lab, Denver, CO 2009-08-11 NA Laboratory Reporting Level 0.02 mg/l NA NWIS 6 6 2009-07-13 18:00:00 2009-07-13 18:30:59 ca

Final tidy dataset

Okay that is getting better but we still have lots of extraneous information. For our purposes let’s assume that the sample and analytical methods used by the USGS are reasonable and exchangeable (one method is equivalent to the other). If we make that assumption then the only remaining data we need to clean is to make sure that all the data has the same units.

Unit Check

table(conc.clean$units)
## 
##  mg/l 
## 17395

All the data is in mg/L!

We just need to remove these observations with a dplyr::filter call and then select an even smaller subset of useful columns, while adding a time object column using the lubridate::ymd call.

conc.tidy <- conc.clean %>% 
  mutate(date=ymd(date)) %>%
  select(date,
         parameter,
         SiteID,
         conc=value)

Daily data

Okay now we have a manageable data frame. But how do we want to organize the data? Since we are looking at a really long time-series of data (70 years), let’s look at data as a daily average. The dplyr::group_by and summarize commands make this really easy

#The amazing group_by function groups all the data so that the summary
#only applies to each subgroup (site, date, and parameter combination).
#So in the end you get a daily average concentratino for each site and parameter type. 
conc.daily <- conc.tidy %>%
  group_by(date,parameter,SiteID) %>% 
  summarize(conc=mean(conc,na.rm=T))
## `summarise()` has grouped output by 'date', 'parameter'. You can override using
## the `.groups` argument.

Taking daily averages looks like it did eliminate 445 observations, meaning these site date combinations had multiple observations on the same day.

Analyzing data

Now we have a ‘tidy’ dataset. Let’s look at the data we have. First where is the data?

Map

Here we use the sf package to project the site information data into a GIS type data object called a simple feature (sf). The function st_as_sf converts the long (x) and lat (y) coordinates into a projected point feature with the EPSG code 4326 (WGS 84). We can then use the mapview package and function to look at where these sites are.

#convert site info as an sf object
site.sf <- site.info %>%
  st_as_sf(.,coords=c('long','lat'), crs=4326)


mapview(site.sf )

So these sites are generally in the Colorado River Basin with increasing size.

Concentration data

Now that we know where the data is coming from let’s look at the actual data we downloaded using ggplot2

Calcium only

To keep the plots simple at first let’s look at calcium data by site.

conc.daily %>%
  filter(parameter == 'Calcium') %>%
  ggplot(.,aes(x=date,y=conc)) + 
  geom_point() + 
  facet_wrap(~SiteID)

Okay that’s a lot of data! Maybe too much. Let’s focus in on sites with only continuous datasets and then summarize the data by year

Annual summaries of full sites

Let’s shrink the dataset to only look at annual change.

too.few.years <- c('USGS-09034500','USGS-0907110','USGS-0908500')

conc.annual <- conc.daily %>%
  filter(!SiteID %in% too.few.years) %>% #! means opposite of, so we want all the sites not in the too.few years vector. 
  mutate(year=year(date)) %>%
  group_by(SiteID,year,parameter) %>%
  summarize(annual_mean=mean(conc,na.rm=T),
            annual_var=var(conc,na.rm=T))
## `summarise()` has grouped output by 'SiteID', 'year'. You can override using
## the `.groups` argument.

Plot all the annual data.

conc.annual %>%
  ggplot(.,aes(x=year,y=annual_mean,color=SiteID)) + 
  geom_point() + 
  facet_wrap(~parameter,scales='free')

That plot is… ugly! Maybe we can make something prettier

Prettier annual plot.

Having the points colored by SiteID is not super useful, unless you have memorized the name and location of USGS gauge data. So maybe we can color it by the table we used to download the data? That table colorado was organized such that each basin had it’s own name or was increasing in basin size. That’s a better way to think about the data than as SiteID names. So let’s use join functions to join the datasets and use the basin names. We’ll also use the package ggthemes to try and make the plots prettier.

conc.annual %>%
  left_join(colorado %>%
              rename(SiteID=sites),by='SiteID') %>%
  ggplot(.,aes(x=year,y=annual_mean,color=basin)) + 
  geom_point() + 
  facet_wrap(~parameter,scales='free') + 
  theme_few() + 
  scale_color_few() + 
  theme(legend.position=c(.7,.15)) + 
  guides(color=guide_legend(ncol=2))

Watershed size

Many prior publications have shown that increasing watershed size means decreasing variance in anion and cation concentrations. We can use our dataset to test this in the colorado basin.

conc.annual %>%
  left_join(site.info,by='SiteID') %>%
  filter(annual_var < 5000) %>%
  ggplot(.,aes(x=area,y=annual_var,color=year)) + 
  geom_point() + 
  scale_x_log10() + 
  facet_wrap(~parameter,scales='free') + 
  theme_few() + 
  theme(legend.position=c(.7,.15)) 
## Warning: Removed 42 rows containing missing values (`geom_point()`).

Cool! Looks like it’s generally true across all constituents. Potassium has low variance. Why? No clue!

Reshaping the data

From basic weathering geochemistry principles we know that Bicarbonate concentrations should be correlated with Mg and Ca depending on the weathering reactions that generate these river signals. The current shape of the data in a ‘long’ format makes looking at these correlations impossible. So we want to ‘widen’ the data so the parameters are arranged in side by side columns. This is really easy with tidyr spread and gather!

head(conc.annual)
## # A tibble: 6 × 5
## # Groups:   SiteID, year [1]
##   SiteID         year parameter annual_mean annual_var
##   <chr>         <dbl> <chr>           <dbl>      <dbl>
## 1 USGS-09069000  1981 Calcium          75      1621   
## 2 USGS-09069000  1981 Chloride         64.7    2441.  
## 3 USGS-09069000  1981 Magnesium        14.1      54.6 
## 4 USGS-09069000  1981 Potassium         2.2       0.79
## 5 USGS-09069000  1981 Sodium           42.6     925.  
## 6 USGS-09069000  1981 Sulfate         151      8103
conc.wide <- conc.annual %>%
  select(-annual_var) %>%
  spread(key=parameter,value=annual_mean) %>%
  mutate(`Mg+Ca`=Magnesium+Calcium)


ggplot(conc.wide,aes(x=Bicarbonate,y=`Mg+Ca`,color=SiteID)) + 
  geom_point() + 
  geom_abline(slope=1,intercept=0)
## Warning: Removed 96 rows containing missing values (`geom_point()`).

Model changes

It looks to me like there might be some trends in the data at certain sites. (Mg and SO4 in particular). Let’s use some advanced r to check if there are some linear trends in these datasets.

Nesting and modelling

There is a really excellent package called purrr and tidyr make doing multiple models on different sites really easy. Quick example here

conc.nest <- conc.annual %>%
  group_by(parameter,SiteID) %>%
  nest() 

head(conc.nest)
## # A tibble: 6 × 3
## # Groups:   parameter, SiteID [6]
##   SiteID        parameter data             
##   <chr>         <chr>     <list>           
## 1 USGS-09069000 Calcium   <tibble [43 × 3]>
## 2 USGS-09069000 Chloride  <tibble [43 × 3]>
## 3 USGS-09069000 Magnesium <tibble [43 × 3]>
## 4 USGS-09069000 Potassium <tibble [43 × 3]>
## 5 USGS-09069000 Sodium    <tibble [43 × 3]>
## 6 USGS-09069000 Sulfate   <tibble [43 × 3]>

That nests or wraps up the data into tidy little bundles. We can access these bundled datasets by using a map function that applies a model inside a bundle.

#Create a generic model function (mean as a function of time)
time.lm.models <- function(x){
  mod <- lm(annual_mean~year,data=x)
}

conc.models <- conc.nest %>%
  mutate(mods=map(data,time.lm.models))

Now we have an individual time-series analysis model for each site and parameter combination. But how do we see the results in a tidy way? Broom to the rescue!

conc.models %>%
  mutate(mod.glance=map(mods,glance)) %>%
  unnest(mod.glance) %>% #Unnesting unwraps the nested column. 
  arrange(desc(adj.r.squared)) %>%
  select(parameter,SiteID,adj.r.squared,p.value,logLik,AIC) %>%
  kable(.,'html') %>%
  kable_styling() %>%
  scroll_box(width='600px',height='500px')
parameter SiteID adj.r.squared p.value logLik AIC
Sodium USGS-09071100 0.8505215 0.0055919 -18.012490 42.024980
Chloride USGS-09071100 0.8216059 0.0080335 -21.025592 48.051183
Sulfate USGS-09071100 0.7682778 0.0137781 -19.488426 44.976853
Calcium USGS-09071100 0.7352453 0.0181757 -14.935576 35.871153
Potassium USGS-09071100 0.7341127 0.0183382 1.949735 2.100531
Magnesium USGS-09071100 0.7292619 0.0190432 -5.301195 16.602390
Chloride USGS-09069000 0.5150982 0.0000000 -184.247218 374.494436
Sodium USGS-09069000 0.4689976 0.0000002 -162.157090 330.314180
Bicarbonate USGS-09071100 0.4033104 0.1045065 -20.543928 47.087856
Magnesium USGS-09380000 0.2431260 0.0003980 -87.055696 180.111391
Potassium USGS-09069000 0.2185269 0.0009279 -7.072389 20.144778
Sulfate USGS-09380000 0.2093011 0.0010565 -198.753679 403.507358
Chloride USGS-09085000 0.2003561 0.0052491 -118.485540 242.971079
Bicarbonate USGS-09069000 0.1926983 0.0099710 -119.634004 245.268008
Potassium USGS-09085000 0.1786655 0.0082675 9.036272 -12.072543
Chloride USGS-09095500 0.1620623 0.0043320 -198.645337 403.290674
Sodium USGS-09085000 0.1541017 0.0137096 -106.503219 219.006439
Bicarbonate USGS-09152500 0.1066049 0.0293191 -157.144426 320.288852
Sodium USGS-09095500 0.0886679 0.0295167 -179.031311 364.062623
Sodium USGS-09380000 0.0805532 0.0346443 -152.634348 311.268696
Potassium USGS-09180500 0.0597829 0.0698577 -44.082684 94.165369
Sulfate USGS-09095500 0.0597640 0.0624041 -185.834372 377.668745
Sulfate USGS-09069000 0.0585130 0.0644717 -194.202439 394.404879
Magnesium USGS-09180000 0.0554354 0.0864063 -134.646260 275.292519
Bicarbonate USGS-09180000 0.0505978 0.2469939 -58.023338 122.046676
Chloride USGS-09380000 0.0454392 0.0882044 -150.416935 306.833870
Sulfate USGS-09180000 0.0400666 0.1226543 -209.718199 425.436398
Magnesium USGS-09095500 0.0271815 0.1480422 -71.487423 148.974846
Magnesium USGS-09180500 0.0103632 0.2426893 -122.367551 250.735102
Potassium USGS-09180000 0.0094519 0.2542619 -128.533001 263.066001
Sodium USGS-09152500 0.0086033 0.2478760 -167.272022 340.544044
Bicarbonate USGS-09085000 0.0063454 0.2905254 -107.639224 221.278449
Sodium USGS-09180500 0.0012575 0.3121898 -178.528316 363.056632
Sulfate USGS-09152500 0.0009586 0.3133711 -248.864741 503.729482
Sodium USGS-09180000 0.0006633 0.3185417 -240.247799 486.495598
Sulfate USGS-09180500 -0.0021020 0.3440066 -228.412545 462.825090
Chloride USGS-09180000 -0.0026215 0.3477389 -260.016625 526.033251
Potassium USGS-09095500 -0.0031343 0.3567518 -62.574697 131.149393
Calcium USGS-09380000 -0.0048760 0.3787597 -132.826832 271.653665
Calcium USGS-09152500 -0.0049965 0.3802952 -187.090774 380.181547
Sulfate USGS-09085000 -0.0072864 0.3874177 -142.531920 291.063840
Bicarbonate USGS-09180500 -0.0095471 0.3750730 -97.978248 201.956496
Bicarbonate USGS-09380000 -0.0100246 0.4421212 -138.559648 283.119296
Calcium USGS-09095500 -0.0103447 0.4545851 -148.601535 303.203071
Potassium USGS-09152500 -0.0127848 0.5026462 -18.446256 42.892512
Potassium USGS-09380000 -0.0135277 0.5174796 -15.934683 37.869366
Calcium USGS-09085000 -0.0142575 0.4638291 -118.370218 242.740436
Chloride USGS-09152500 -0.0162733 0.5797529 -83.449405 172.898810
Calcium USGS-09180500 -0.0195401 0.6181924 -177.349005 360.698010
Magnesium USGS-09069000 -0.0207052 0.7024231 -83.209312 172.418623
Magnesium USGS-09152500 -0.0212916 0.7492079 -136.810540 279.621079
Chloride USGS-09180500 -0.0224410 0.7064398 -186.089596 378.179191
Calcium USGS-09069000 -0.0241495 0.9222661 -152.635248 311.270496
Bicarbonate USGS-09095500 -0.0250871 0.6100680 -127.481956 260.963912
Magnesium USGS-09085000 -0.0269819 0.6925723 -63.192007 132.384013
Calcium USGS-09180000 -0.0273718 0.8409599 -162.070833 330.141667

Cool! Lots of these constituents have significant trends. Many of them are declining, why? what does that mean for water quality? I don’t know. But we could probably use more public data to investigate.

Assignment

The above code is… voluminous! But the goal was to show you the full pipeline of downloading, cleaning, harmonizing, analyzing, and displaying public water quality data. For this assignment you will continue to build out your own ability to analyze large complex data.

Background

The reason so many of these sites in Colorado have lots of chemical concentration data is because the Colorado River is a highly critical resource throughout the Western USA, but it has suffered from impaired water quality, specifically increases in ion concentration leading to high levels of salinity. This is particularly true in the Gunnison River. This occurs both because the bedrock is full of readily weathered ions, and because irrigation practices artificially elevate weathering rates (more water = more ions released). As such the Gunnison River has a salinity control plan (https://gunnisonriverbasin.org/water-quality/salinity-control/).

For this assignment you will be exploring changes in Salinity in the Gunnison River, Dolores River, and the Colorado River above and below where these two rivers join the CO River (all of this is near Grand Junction). The main questions we have are:

  1. What are trends in specific conductivity(salinity) in these three rivers since 2007?

  2. What is the correlation between specific conductivity and individual ion concentration?

  3. How does discharge control these relationships?

Q2) 2) What is the correlation between specific conductivity and individual ion concentration?

Join conc.daily to your specific conductance sc data

# Think about if you want a full join, inner join, left join, etc...

sc_ions <- sc_focus %>%
  ## Add a new siteid column so it matches conc.daily
  mutate(SiteID = paste0('USGS-',site_no)) %>% 
  #rename the date column so we can join these
  rename(date = Date) %>%
  # join to the conc.daily dataset. We only want matching data here!
  inner_join(conc.daily)
## Joining with `by = join_by(date, SiteID)`

Plot the correlation with sc on the x axis and individual ions on the y

ggplot(sc_ions, aes(x = SpecCond, y = conc, color = basin))+
  geom_point()+
  facet_wrap(~parameter, scales = "free")+
  ylab("Concentration of Ion (mg/L)") + 
  scale_x_log10() + 
  scale_y_log10()

Use the map and modelling code to test for correlation with lm

conc_SC_lm <- function(x){
  mod <- lm(conc ~ SpecCond,data=x)%>%
    tidy()
}

conc_vs_sc_mods <- sc_ions %>%
  group_by(parameter)%>%
  nest() %>%
  mutate(mods=map(data,conc_SC_lm))%>%
  unnest(cols = mods)%>%
  select(-data)%>%
  mutate(p.value = round(p.value, digits = 7))

What Ions are correlated with SC? Why might this be?

Everything is correlated with SC! These are all ions and SC is a measure of the ions in the water so as ions in the water (any of these parameters) increase, SC should also increase :)

Q3) How does discharge control these relationships?

Free play.

This section is meant to encourage you to play with the data, you definitely will need to download discharge data at our focus sites, but how you explore this question is up to you. What’s the relationship between discharge and SC? Does this relationship alter the relationship between SC and Ions? Does a model with both SC AND discharge predict water quality concentrations better than SC alone? Just try one of these approaches and write out why you think this might be happening.

# 0007
#grab daily discharge from USGS sites using dataRetrival package
Q <- readNWISdv(siteNumbers = focus_sites$site_nums,
                       parameterCd = "00060", 
                       startDate = "2007-01-01")%>%
  renameNWISColumns()%>%
  mutate(SiteID = paste0("USGS-", site_no))%>%
  select(SiteID, date = Date, Flow)

all_data <- sc_ions%>%
  left_join(Q, by = c("SiteID", "date"))%>%
  mutate(flow_L_day = Flow * 2446575 , 
         daily_load_kg = (conc* flow_L_day)/1000000 )%>%
  select(SiteID, basin,date, month, year, SpecCond, parameter, conc, Flow, flow_L_day, daily_load_kg)



# make a plot of Q and spec cond facetted by site
ggplot(all_data, aes( x = Flow, y = SpecCond, color = basin))+
  geom_point()

# Param daily load vs spec cond
ggplot(all_data,aes( x = SpecCond, y = daily_load_kg, color = basin))+
  geom_point()+
  facet_wrap(~parameter, scales = "free")+
  xlim(0,5000)+
  ylab("Daily Load (kg) for ion")

# param daily load vs time
ggplot(all_data,aes( x = date, y = daily_load_kg, color = basin))+
  geom_point()+
  facet_wrap(~parameter, scales = "free")+
  ylab("Daily Load (kg) for ion")

# param daily load vs time ONLY SUMMER/FALL
ggplot(filter(all_data, month %in% c(7,8,9,10)),aes( x = date, y = daily_load_kg, color = basin))+
  geom_point()+
  geom_smooth()+
  facet_wrap(~parameter, scales = "free")+
  ylab("Daily Load (kg) for ion")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# param daily concentration vs time ONLY SUMMER/FALL
ggplot(filter(all_data, month %in% c(7,8,9,10)),aes( x = date, y = conc, color = basin))+
  geom_point()+
  geom_smooth()+
  facet_wrap(~parameter, scales = "free")+
  ylab("Concentration of Ion (mg/L)")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

load_sc_mod<- function(x){
  mod <- lm(daily_load_kg ~ SpecCond,data=x)%>%
    tidy()
}

load_model_sitewise <- all_data %>%
  group_by(basin, parameter)%>%
  nest() %>%
  mutate(mods=map(data,load_sc_mod))%>%
  unnest(cols = mods)%>%
  select(-data)%>%
  mutate(p.value = round(p.value, digits = 5))


load_model <- all_data %>%
  group_by(parameter)%>%
  nest() %>%
  mutate(mods=map(data,load_sc_mod))%>%
  unnest(cols = mods)%>%
  select(-data)%>%
  mutate(p.value = round(p.value, digits = 5))


kable(load_model)
parameter term estimate std.error statistic p.value
Bicarbonate (Intercept) 2.320793e+06 1.297157e+05 17.891383 0.00000
Bicarbonate SpecCond -1.674610e+03 1.481686e+02 -11.302053 0.00000
Calcium (Intercept) 6.263516e+05 3.358557e+04 18.649426 0.00000
Calcium SpecCond -1.092820e+02 2.160462e+01 -5.058271 0.00000
Chloride (Intercept) 4.131624e+05 2.806114e+04 14.723648 0.00000
Chloride SpecCond 2.231854e+01 1.808181e+01 1.234309 0.21772
Magnesium (Intercept) 1.661488e+05 8.548982e+03 19.434918 0.00000
Magnesium SpecCond -2.665124e+01 5.499311e+00 -4.846286 0.00000
Potassium (Intercept) 2.602272e+04 1.591992e+03 16.346011 0.00000
Potassium SpecCond -3.372231e+00 1.024082e+00 -3.292929 0.00107
Sodium (Intercept) 4.373280e+05 2.380720e+04 18.369573 0.00000
Sodium SpecCond -1.720191e+01 1.531448e+01 -1.123245 0.26193
Sulfate (Intercept) 1.378909e+06 7.123726e+04 19.356575 0.00000
Sulfate SpecCond -1.904079e+02 4.598990e+01 -4.140212 0.00004

SC has an inverse relationship with Q. As flows decrease, SC increases because there are still a large mass of ions in the water but the water is now less diluted and thus concentration is higher. In practice, it would probably be best to use some adjusted SC value based on flows to help determine how much mass of ions are in the stream at a given time.

I tried to do a quick look at load for each parameter (probably calculating load incorrectly by just doing concentration (mg/L)* flow (L/day)). It seems that each site has very different amounts of load and results in differing relationships between SC and parameters. Most parameter still had significant relationship with SC but there was not a significant relationship between Cl and SC at the Colorado River Sites.

In the final two graphs above, there is an interesting difference between concentration values over time vs load over time. It appears like concentrations in the fall have not changed much over the last decade but when they are adjusted to loads it does appear that there is a decrease in loads over time for all parameters across all sites other than Chloride and Sodium.